In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [2]:
a = np.array([-2, 3, 4, -5, 5])
print(a)
Apart from indexing with integers and slices NumPy also supports indexing with arrays of integers (so-called fancy indexing). For example, to get the 2nd and 4th element of a
:
In [3]:
a[[1, 3]]
Out[3]:
To select data fulfilling specific criteria, one can use the bolean indexing. This is best illustrated on 1D arrays; for example, lets select only positive elements of a
:
In [4]:
a[a > 0]
Out[4]:
Note that the index array has the same size as and type of boolean:
In [5]:
print(a)
print(a > 0)
Multiple criteria can be also combine in one query:
In [6]:
a[(a > 0) & (a < 5)]
Out[6]:
a
In [ ]:
a
In [ ]:
Series
can be indexed similarly to 1D NumPy array.
In [9]:
pop_dict = {'Germany': 81.3,
'Belgium': 11.3,
'France': 64.3,
'United Kingdom': 64.9,
'Netherlands': 16.9}
population = pd.Series(pop_dict)
print(population)
We can use fancy indexing with the rich index:
In [10]:
population[['Netherlands', 'Germany']]
Out[10]:
Similarly, boolean indexing can be used to filter the Series
. Lets select countries with population of more than 20 millions:
In [11]:
population[population > 20]
Out[11]:
You can also do position-based indexing by using integers instead of labels:
In [12]:
population[:2]
Out[12]:
In [13]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
Out[13]:
In [14]:
countries = countries.set_index('country')
countries
Out[14]:
For a DataFrame
, basic indexing selects the columns.
Selecting a single column:
In [15]:
countries['area']
Out[15]:
or multiple columns using fancy indexing:
In [16]:
countries[['area', 'population']]
Out[16]:
But, slicing accesses the rows:
In [17]:
countries['France':'Netherlands']
Out[17]:
We can also select rows similarly to the boolean indexing in numpy. The boolean mask should be 1-dimensional and the same length as the thing being indexed. Boolean indexing of DataFrame
can be used like the WHERE
clause of SQL to select rows matching some criteria:
In [18]:
countries[countries['area'] > 100000]
Out[18]:
So as a summary, []
provides the following convenience shortcuts:
NumPy/`Series` | `DataFrame` | |
Integer index `data[label]` |
single element | single **column** |
Slice `data[label1:label2]` |
sequence | one or more **rows** |
Fancy indexing `data[[label1,label2]]` |
sequence | one or more **columns** |
Boolean indexing `data[mask]` |
sequence | one or more **rows** |
In [ ]:
When using []
like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:
loc
: selection by labeliloc
: selection by positionThese methods index the different dimensions of the frame:
df.loc[row_indexer, column_indexer]
df.iloc[row_indexer, column_indexer]
Selecting a single element:
In [20]:
countries.loc['Germany', 'area']
Out[20]:
But the row or column indexer can also be a list, slice, boolean array, ..
In [21]:
countries.loc['France':'Germany', ['area', 'population']]
Out[21]:
Selecting by position with iloc
works similar as indexing numpy arrays:
In [22]:
countries.iloc[:2,1:3]
Out[22]:
The different indexing methods can also be used to assign data:
In [23]:
countries2 = countries.copy()
countries2.loc['Belgium':'Germany', 'population'] = 10
In [24]:
countries2
Out[24]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv
and cast.csv
and put them in the /data
folder.
In [30]:
cast = pd.read_csv('data/cast.csv')
cast.head()
Out[30]:
In [31]:
titles = pd.read_csv('data/titles.csv')
titles.head()
Out[31]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: